Method and apparatus for speech recognition

ABSTRACT

A method and apparatus for speech recognition of the present application has a process to collate, with an input utterance, an acoustic model corresponding to a hypothesis to be expressed by the connection of utterance segments, such as phonemes or syllables, and developed according to a length of an input utterance by an inter-word connection rule thereby obtaining a recognition score. Within a word of the hypothesis, the similar hypotheses high in utterance score within a predetermined threshold from the maximum value of the score are all held to a word end irrespectively of the number of hypotheses. Meanwhile, at a word end of the hypotheses, the hypotheses are narrowed to a predetermined number of upper ranking in the order of higher score.

FIELD OF THE INVENTION

[0001] The present invention relates to an art of speech recognition tobe mounted on the general industrial and home-use electric appliancesand, more particularly, to a method and apparatus for speech recognitionimproved in speech recognition rate.

BACKGROUND OF THE INVENTION

[0002] Conventionally, there has been a method and apparatus for speechrecognition of, e.g. “Hermann Ney: Data Driven Search Organization forContinuous Speech Recognition (IEEE TRANSACTIONS ON SIGNAL PROCESSINGVol. 40 No. 2 p272 1992)”.

[0003]FIG. 8 is a process flow of a speech recognition system as arelated art. The process steps shown in the figure are executedsynchronously with the frame of an input utterance. By executing to theend of the input utterance, a hypothesis approximate to the inpututterance is obtained as a result of recognition. The search employingsuch a method is referred to as a frame synchronization beam search.Explanation is made below on each of the steps.

[0004] Using the one-pass search algorithm, a hypothesis is establishedon the i-th frame of an input utterance and developed in the (i+1)-thframe. If the hypothesis is within a word, an utterance segment is usedto express the word. Otherwise, if the hypothesis is at a word end, aword to follow is joined according to an inter-word connection rule.This extends the first utterance segment. The hypothesis on the i-thframe is erased to store only the (i+1)-th hypothesis (step S801).

[0005] Next, among the hypotheses developed in the (i+1)-th frame bystep S801, the hypothesis highest in the score accumulated up to the(i+1)-th frame (hereinafter, referred to as cumulative score) is takenas a reference. Stored are only the hypotheses having a score within aconstant threshold with respect to the score while the other hypothesesthan that are erased. This is referred to as narrowing the candidates.The narrowing avoids the number of hypotheses from increasing in anexponential fashion and hence becoming impossible to compute. (stepS802)

[0006] Next, the process moves to the next frame that is “+1”-ed to thecurrent frame i. At this time, determination is made as to whether thelast frame or not. If the last frame, the process is ended. If not thelast frame, the process moves again to step 1. (step S803)

[0007] As in the foregoing, the related-art method narrows down thehypothetic candidates depending only upon whether the cumulative scoreis within a threshold or not.

[0008] Incidentally, there is, e.g. Japanese Patent Laid-Open No.6588/1996 as a speech recognition method to accurately evaluatehypotheses in the frame synchronization beam search. The speechrecognition method described in this publication shows the computationfor normalization against time in the frame synchronization beam search.Namely, the score on a hypothesis at time t is subtracted by the commonlikelihood function to all the hypotheses. Then, stored is a maximumvalue of the normalized score and hypothesis having a score normalizedwithin a constant threshold with respect to the maximum value.

[0009] In the related-art speech recognition system, however, thehypothesis within a word or at a word end takes as a reference ahypothesis highest in cumulative score as noted above, to store ahypothesis having a score within a constant threshold with respect tothe score. Consequently, at the word end there are a number ofconnectable word candidates to follow, thus incurring great increase inthe number of hypotheses. As a result, there has been a setback todifficult computation in selecting hypothetic candidates.

[0010] The present invention has been made to solve the problem. It isan object to provide a method and apparatus for speech recognitioncapable of effectively reducing the computation amount in selectinghypothetic candidates while securing the accuracy of speech recognition.

SUMMARY OF THE INVENTION

[0011] A method for speech recognition according to the presentinvention for solving the problem includes, in a frame synchronizationbeam search, a process, within a candidate word, to leave to a word endthe similar hypothesis high in acoustic score irrespectively of thenumber of hypotheses and, at an end of a candidate word, to narrow downthe number of hypotheses. Namely, the method for speech recognitioncomprises: a feature-amount extracting step for extracting a featureamount based on a frame of an input utterance; a storing step fordetermining whether a current processing frame is at an end of or withina candidate word previously registered, and storing the candidate wordon the basis of a first hypothesis-storage determining criterion when ata word end and on the basis of a second hypothesis-storage determiningcriterion when within a word; a developing step for developing ahypothesis by extending utterance segments expressing the word when astored candidate word is within a word and by joining a word to followaccording to an inter-word connection rule when at a word end; anoperating step of computing a similarity of between the feature amountextracted from the input utterance and a frame-based feature amount ofan acoustic model of the developed hypothesis, and calculating a newrecognition score from the similarity and a recognition score of thehypothesis of up to an immediately preceding frame calculated from thesimilarity; and a step of repeating the storing step, the developingstep and the operating step until the processing frame becomes a lastframe of the input utterance, and outputting, as a recognition resultapproximate to the input utterance, at least one of hypotheses in theorder of higher recognition score due to processing the last frame.

BRIEF DESCRIPTION OF THE DRAWINGS

[0012]FIG. 1 shows a system configuration diagram showing a speechrecognition apparatus in an embodiment of the present invention;

[0013]FIG. 2 shows a block diagram of a hardware configuration of aspeech recognition processing section in the embodiment of theinvention;

[0014]FIG. 3 shows a block diagram of a functional configuration of thespeech recognition processing section in the embodiment of theinvention;

[0015]FIG. 4 is a flowchart showing a process procedure of the speechrecognition processing section in the embodiment of the invention;

[0016]FIG. 5 shows an explanatory figure on a set of candidate words tobe first registered and recognition scores thereof in the embodiment ofthe invention;

[0017]FIG. 6 shows a process progress diagram for hypothesisdetermination in the embodiment of the invention;

[0018]FIG. 7 shows an example of a inter-word connection rule in theembodiment of the invention;

[0019]FIG. 8 is a flowchart showing a process procedure by a relatedart.

DESCRIPTION OF THE PREFERRED EMBODIMENTS

[0020] The embodiments of the present invention will now be explainedwith reference to the drawings.

[0021]FIG. 1 shows an example of a configuration diagram of a speechrecognition apparatus in an embodiment of the invention.

[0022] In FIG. 1, the speech recognition apparatus includes a microphone101, a speech-recognition processing section 102, an external storageunit 103 and an output unit 104.

[0023] The microphone 101 is to capture an utterance spoken by a user,and integral with the speech recognition apparatus. Note that themicrophone 101 is not necessarily in one body with the speechrecognition apparatus.

[0024] The speech-recognition processing section 102, when detecting aninput of utterance through the microphone 101, processes to recognize aword uttered by the user from among the candidate words as a subject ofspeech recognition.

[0025] The external storage unit 103 is stored with a processing programto be executed in the speech-recognition processing section 102.

[0026] The output unit 104 is a liquid crystal panel to display the wordor text recognized by the speech-recognition processing section 102.

[0027] Explanation is now made on the operation outline of the presentembodiment.

[0028] The speech recognition apparatus, at turning on power, loads aprocessing program as a data signal SIG2 from the external storage unit103 to the speech-recognition processing section 102. The processingprogram is to be executed after stored to a main storage section of thespeech-recognition processing section 102. Then, the speech-recognitionprocessing section 102 receives an utterance signal SIG1 of user'sutterance words for recognition through the microphone 101 and stores itto the main storage section of the speech-recognition processing section102. The user's utterance words may be a word or a text consisting of aplurality of sentences. Next, the speech-recognition processing section102 performs a recognition process on the input utterance in the orderof from the top thereof according to the processing program, to displaya closest-matched word or text from among the candidate words on theoutput unit 104 controlled by a signal SIG3.

[0029] Explanation is now made on an example of a hardware configurationof the speech-recognition processing section 102 with reference to FIG.2.

[0030] The speech-recognition processing section 102 includes an A/Dconverter 201 to convert the analog signal inputted from the microphone101 into a digital signal, a main storage section 202 to store data andprocessing programs, an information processing section 203 to processdata according to the program, an acoustic model 204 configured with aplurality of frames modeled with the acoustic features based onutterance segments such as phonemes and syllables to express the word asa subject of recognition, a language model 205 describing a connectionrule of between the words for recognition, a word lexicon 206 registeredwith candidate-word sets, an inter-word connection rule 209 recordingthe list of words to follow a certain word, a DMA (Direct Memory Access)207 for transferring the process program at high rate from the externalstorage unit 103 to the main storage section 202, and a PIO (ParallelI/O) 208 for bidirectional parallel communication between the externalstorage unit 103 and the output unit 104 and delivering datasynchronously onto a bus. Note that, in the figure, the devices 201 to209 are connected through the bus. Next, explanation is made on thefunctional block configuration of the speech-recognition processingsection 102 to be realized by the hardware configuration noted abovewith reference to FIG. 3.

[0031] The storage section 301 temporarily stores input utterance data,feature amount vectors, candidate words and so on. The feature amountextracting section 302 extracts a feature amount of utterance from theinput utterance. An intra-word word end determining section 303determines whether a hypothesis is within a word or at a word end. Anintra-word hypothesis storage determining section 304 determines whetherto store a hypothetic candidate word or not, by an utterance-basedrecognition score. A word-end hypothesis storage determining section 305determines whether to store a hypothesis or not, by the number ofhypothetic candidate words. A search control section 306 extends theutterance segments expressing a word if the hypothesis is within a word,and joins a word to follow in compliance with the inter-word connectionrule described in the language model 205 when at a word end. Thus, thecontrol section 306 carries out development control of the hypothesis ina frame synchronization beam search to develop the hypothesis. Asimilarity computing section 307 computes a similarity of between aframe-based feature amount of the input utterance outputted from thefeature amount extracting section 302 and the acoustic model 204. Asearch operating section 308 computes a recognition score from thesimilarity computed by the similarity computing section 307 and therecognition score of the hypothesis of up to the immediately precedingframe. The hypothesis updating section 309 updates the hypothesis andcomputed recognition score. A speech-recognition end determining section310 determines whether or not the process has been completed up to theend of the input utterance data stored in the storage section 301. Arecognition result outputting section 311 continues the framesynchronization beam search to the end of input utterance and outputs,as a recognition result, an outputtable hypothesis high in recognitionscore.

[0032]FIG. 4 is a flowchart showing a data process flow by thefunctional blocks in the speech-recognition processing section 102.Using the flowchart, the data process flow is explained.

[0033] In the figure, S represents each process step wherein eachprocess step is to be realized by the functional block of FIG. 3.

[0034] At first, the entire utterance signal spoken by the user istemporarily stored based on a frame of 10 ms to the storage section 301(step S401).

[0035] Next, the utterance input is detected, to copy from the wordlexicon 104 an initial set of a hypothesis including candidate wordspreviously registered and recognition scores having an initial value of‘0’ and store it to the storage section 301 (step S402). The presentembodiment stored an initial set of a hypothesis including the candidatewords and recognition scores as shown in FIG. 5. FIG. 5 is an examplestoring five words 501, i.e. “arrow”, “after”, “access”, “accept” and“eel” and the respective recognition scores 502 (initial value ‘0’). Forthe candidate words, it is possible to register coined words having nolinguistic meaning.

[0036] Then, the feature amount extracting section 302 conducts LPCcepstrum analysis, only in the first time after utterance, on all theaccumulated frames and extracts LPC cepstral coefficient vectors,storing them again to the storage section 301. From then on, LPCcepstrum vectors are read out of the storage section 301 according tosequential recognition. (step S403) Although the feature amount to beextracted used LPC cepstral coefficient vectors, similar effect isavailable with other acoustic parameters such as MFCC (mel frequencycepstral coefficients).

[0037] Next, the intra-word word end determining section 303 determineswhether being currently processed is the utterance segment within a wordor the utterance segment at a word end (step S404). At the top of auser's utterance, assumption is on the utterance segment within a word.At other than the top of utterance, when the current processing frame onhypothesis is within a word instead of at a word end, the intra-wordhypothesis storage determining section 304 narrows, with reference to ahypothesis within a word highest in recognition score of among thecurrent candidate words, down to the intra-word hypotheses havingrecognition scores within a constant threshold with respect to thatrecognition score (step S405). Where the hypothesis is at a word end,the word-end hypothesis storage determining section 305 selectshypotheses in the order of higher recognition score from the currentcandidate words, to narrow the hypotheses according to the number ofhypotheses (step S406).

[0038] Then, the search control section 306 extends the utterancesegments to express a word if the narrowed hypothesis is within a word,and join a word to follow according to the inter-word connection rule209 if at a word end, thus carrying out development as a new hypotheticcandidate word (step S407).

[0039] Then, the similarity computing section 307 computes a similarityon the developed hypothesis from a feature amount of the currentlyprocessing frame of input utterance and a feature amount of the phonemesas utterance segments of a selected candidate word 501. The searchoperating section 308 adds together the similarity and the hypothesisrecognition score of up to the immediately preceding frame, therebydetermining a recognition score (step S408). These processes are calleda frame synchronization beam search operation. Note that the featureamount of candidate word is extracted from the acoustic model 204 as aset of acoustic parameters based on the phoneme. In the embodiment, thesimilarity used a statistical distance measure expressed in Equation(1). From the similarity L (i, j), an acoustic score was determined byEquation (2).

[0040] In Equation (1), the acoustic score as (i, j) is at aninput-utterance frame i and acoustic-model lexicon frame j.

L(i,j)=(x(i)−μ(j))^(t)Σ(j)⁻¹(x(i)−μ(j))+log|Σ(j)|  (1)

as(i,j)=|L(i,j)|  (2)

[0041] where “t” is a transpose, “−1” is an inverse matrix, x(i) is aninput vector corresponding to an input frame i, and Σ(j) and μ(j) are acovariance matrix and mean-value vector of a feature vectorcorresponding to the lexicon frame j. The foregoing acoustic model,concretely, is a set of covariance matrixes and mean-value vectors onthese lexicon frames. The input vector, in the embodiment, is an LPCcepstrum coefficient vector that is a feature vector the input utteranceis extracted. The lexicon frame is also a feature vector that extractedfrom the acoustic model is a word registered in a word lexiconconsidered corresponding to the input frame.

[0042] Next, the hypothesis updating section 309 updates the developedhypothesis together with a new recognition score (step S409).

[0043] Explanation is made on the process of from process step S404 toS409, using FIG. 6.

[0044] In this embodiment, the constant threshold from the maximumrecognition score as a determination criterion for an intra-wordhypothesis is given ‘3’ and the number of upper ranking recognitionscores as a determination criterion for an word-end hypothesis is ‘2’.Note that the numeral in the circle represents a determined recognitionscore.

[0045] In FIG. 6, the five words stored by step S402 are processed basedon the frame. At time t, the recognition score extended from a word topby ‘e’ was a value “12” and that extended by ‘i’ was a value ‘8’.Because the recognition score of ‘i’ is equal to or less than athreshold (12−3=9), the candidate word “eel” is deleted from thecandidates. Then, the four words other than “eel” are left to continuethe process. At time t+t1, the recognition score of the candidate word“after” is equal to or less than a threshold (24−3=21) and hencedeleted. There are left “arrow”, “access” and “accept”, and the processis continued. At time t+t3, the recognition score “35” of “arrow” isequal to or less than a threshold (45−3=42) and hence deleted. Left are“access” and “accept”, to continue the process. At time t+t5, theremaining two words come to end. However, because the remainders are tothe upper second ranking, “access and “accept” are both left.

[0046] Next, from the inter-word connection rule 209, candidate words tofollow these candidate words are taken to provide new hypotheticcandidates. This example is explained using FIG. 7.

[0047]FIG. 7 is an example of the inter-word connection rule. There areregistrations of “of” and “to” each as a word to follow the word“access” left as a candidate in the embodiment, and registrations of“a”, “the” and “your” as a word to follow the word“accept”. These fivewords are extracted as new candidate words and the hypothesis isupdated. Then, the process returns again to step S403.

[0048] Note that FIG. 6 in the embodiment describes only on thephoneme-based candidate narrowing process. In the actual process,however, similar candidate narrowing process is conducted based on theframe that a phoneme is configured with a plurality of frames.

[0049] In FIG. 4, the speech-recognition end determining section 310determines whether or not the above process has completed to the last ofthe input utterance stored in the storage section 301. Until theend-determining condition is satisfied, the frame-based process of thesteps S403 to S409 is repeated (S410).

[0050] Next, the recognition result output section 311 outputs to theoutput unit 104 a high recognition score of outputtable hypothesis as arecognition result from the set of hypotheses being left upon satisfyingthe end determining condition (S411).

[0051] In the speech recognition according to the present embodiment,the speech recognition process on one word requires a computation amountin average of 1,120,000 word lattice points. It can be considered thatthe computation amount is reduced to nearly a quarter, in view of3,830,000 averaged word lattice points in the related-art method.Herein, the word lattice point refers to a candidate not trimmed out(survived) within the frame when the narrowing process is done throughan utterance from its beginning to end in a frame synchronization beamsearch. Incidentally, the mean number of lattice points per word wasdetermined by Equation (3). $\begin{matrix}{a = {\sum\limits_{s = 1}^{n}{\sum\limits_{f = s}^{e}{{Nf}/U}}}} & (3)\end{matrix}$

[0052] where a: the mean number of total lattices per word, s: framenumber at the beginning of utterance, e: frame number at the end ofutterance, Nf: the number of lattice points in frame number.

[0053] Namely, the division of the summing up, over the total number ofutterances, the total number of lattice points of from the utterancebeginning-s to the utterance end-e by the number of total utterances.

[0054] Meanwhile, concerning the accuracy of speech recognition, thefollowing result was obtained.

[0055] Using the five words used in the embodiment, experimental speechrecognition was conducted on totally 30 persons including 15 men andwomen in each. According to the result, the related-art method had arecognition rate of 81.4% while the method of the invention 81.1%. Inthis manner, the speech recognition by the method of the invention issubstantially not different in accuracy from the related-art method.

[0056] According to the present invention, in a frame synchronizationbeam search, the accuracy of recognition can be secured by exactlycomputing, to a word end, the hypothesis similar in pronunciation andhigh in score within a word, irrespectively of the number of hypotheses.Furthermore, at a word end, reduced is the number of hypotheses due tothe connection of the words to follow. Accordingly, by narrowing thenumber of hypotheses, computation amount can be effectively reducedwhile securing the accuracy of recognition. This increases the speed ofspeech recognition processing and improves the real-time capability.

What is claimed:
 1. A method for speech recognition comprising: afeature-amount extracting step for extracting a feature amount based ona frame of an input utterance; a storing step for determining whether acurrent processing frame is within or at an end of a candidate wordpreviously registered, and storing the candidate word on the basis of afirst hypothesis-storage determining criterion when within a word and onthe basis of a second hypothesis-storage determining criterion when at aword end; a developing step for developing a hypothesis by extendingutterance segments expressing the word when a stored candidate word iswithin a word and by joining a word to follow according to an inter-wordconnection rule when at a word end; an operating step of computing asimilarity of between the feature amount extracted from the inpututterance and a frame-based feature amount of an acoustic model of thedeveloped hypothesis, and calculating a new recognition score from thesimilarity and a recognition score of the hypothesis of up to animmediately preceding frame calculated from the similarity; and a stepof repeating the storing step, the developing step and the operatingstep until the processing frame becomes a last frame of the inpututterance, and outputting, as a recognition result approximate to theinput utterance, at least one of hypotheses in the order of higherrecognition score due to processing the last frame.
 2. A method forspeech recognition according to claim 1, wherein the firsthypothesis-storage determining criterion is to select candidate words ofwithin a predetermined threshold from a maximum value of the recognitionscore while the second hypothesis-storage determining criterion is toselect a predetermined number of candidate words as counted from acandidate word maximum in the recognition score.
 3. An apparatus forspeech recognition comprising: a feature-amount extracting section forextracting a feature amount based on a frame of an input utterance; asearch control section for controlling to develop a hypothesis byextending based on utterance segments to express a word when thehypothesis is within a word and by joining a word to follow according toan inter-word connection rule previously determined when at a word end;a similarity computing section for computing a similarity of between aframe feature amount extracted from the input utterance and a framefeature amount of an acoustic model of the developed hypothesis; asearch operating section for operating a recognition score from thesimilarity and recognition score of the hypothesis of up to animmediately preceding frame; a hypothesis determining section fordetermining whether a current processing frame is within a word or at aword end of the hypothesis and using the recognition score to select acandidate word according to a first determining criterion when within aword and to select a candidate word according to a second determiningcriterion when at a word end; a hypothesis storing device for storing ahypothesis determined to be stored; a word hypothesis registering devicefor registering as a new hypothesis the hypothesis and the recognitionscore; and a recognition result output section for continuing theframe-based process to a last of the input utterance and outputting atleast one hypothesis in the order of higher recognition score.
 4. Anapparatus for speech recognition according to claim 3, wherein the firstdetermining criterion is to select candidate words of within apredetermined threshold from a maximum value of the recognition scorewhile the second determining criterion is to select a predeterminednumber of candidate words as counted from a candidate word maximum inthe recognition score.
 5. A program for executing: a feature-amountextracting step for extracting a feature amount based on a frame of aninput utterance; a storing step for determining whether a currentprocessing frame is within or at an end of a candidate word previouslyregistered, and storing the candidate word on the basis of a firsthypothesis-storage determining criterion when within a word and on thebasis of a second hypothesis-storage determining criterion when at aword end; a developing step for developing a hypothesis by extendingutterance segments expressing the word when a stored candidate word iswithin a word and by joining a word to follow according to an inter-wordconnection rule when at a word end; an operating step of computing asimilarity of between the feature amount extracted from the inpututterance and a frame-based feature amount of an acoustic model of thedeveloped hypothesis, and calculating a new recognition score from thesimilarity and a recognition score of the hypothesis of up to animmediately preceding frame calculated from the similarity; and a stepof repeating the storing step, the developing step and the operatingstep until the processing frame becomes a last frame of the inpututterance, and outputting, as a recognition result approximate to theinput utterance, at least one of hypotheses in the order of higherrecognition score due to processing the last frame.
 6. A programaccording to claim 5, wherein the first hypothesis-storage determiningcriterion is to select candidate words of within a predeterminedthreshold from a maximum value of the recognition score while the secondhypothesis-storage determining criterion is to select a predeterminednumber of candidate words as counted from a candidate word maximum inthe recognition score.
 7. A computer-readable recording medium recordinga program for executing: a feature-amount extracting step for extractinga feature amount based on a frame of an input utterance; a storing stepfor determining whether a current processing frame is within or at anend of a candidate word previously registered, and storing the candidateword on the basis of a first hypothesis-storage determining criterionwhen within a word and on the basis of a second hypothesis-storagedetermining criterion when at a word end; a developing step fordeveloping a hypothesis by extending utterance segments expressing theword when a stored candidate word is within a word and by joining a wordto follow according to an inter-word connection rule when at a word end;an operating step of computing a similarity of between the featureamount extracted from the input utterance and a frame-based featureamount of an acoustic model of the developed hypothesis, and calculatinga new recognition score from the similarity and a recognition score ofthe hypothesis of up to an immediately preceding frame calculated fromthe similarity; and a step of repeating the storing step, the developingstep and the operating step until the processing frame becomes a lastframe of the input utterance, and outputting, as a recognition resultapproximate to the input utterance, at least one of hypotheses in theorder of higher recognition score due to processing the last frame.
 8. Acomputer-readable recording medium recording a program according toclaim 7, wherein the first hypothesis-storage determining criterion isto select candidate words of within a predetermined threshold from amaximum value of the recognition score while the secondhypothesis-storage determining criterion is to select a predeterminednumber of candidate words as counted from a candidate word maximum inthe recognition score.